Homework Policies:

You are encouraged to discuss problem sets with your fellow students (and with the Course Instructor of course), but you must write your own final answers, in your own words. Solutions prepared ``in committee’’ or by copying someone else’s paper are not acceptable. This violates the Brown standards of plagiarism, and you will not have the benefit of having thought about and worked the problem when you take the examinations.

All answers must be in complete sentences and all graphs must be properly labeled.

For the PDF Version of this assignment: PDF

For the R Markdown Version of this assignment: RMarkdown

Turning the Homework in:

Please turn the homework in through canvas. You may use a pdf, html or word doc file to turn the assignment in.

PHP 1511 Assignment Link

PHP 2511 Assignment Link

The Data

This homework will use the following data:

Part 1

  1. The data set hw1a is simulated from a famous example that illustrates how variables in multiple regression can be used to predict the response variable, here Y, jointly even though they do not necessarily predict Y very well individually. There are two predictor variables in the data set, X1 and X2.
  1. Open the data set and begin by looking at all possible two-way scatterplots. Comment on the relationships that you observe.


We can see that it is hard to tell if there is a relationship between X1 and Y. However there does appear to be a positive linear relationship between X2 and \(Y\). We can also see a strong relationship between X1 and X2.

  1. Next, examine the simple linear regressions of each predictor to explain Y. Comment on whether the predictors seem to relate to Y. What percent of the variability in Y does each predictor explain by itself?
term estimate p.value conf.low conf.high
(Intercept) -0.0041133 0.6815769 -0.0237633 0.0155367
x1 0.0082620 0.4105894 -0.0114188 0.0279428
(Intercept) -0.0032395 0.7190647 -0.0208921 0.0144132
x2 0.4393635 0.0000000 0.4217521 0.4569748
r.squared adj.r.squared sigma statistic p.value
0.0000677 -0.0000323 1.0024504 0.6771501 0.4105894
0.1930245 0.1929438 0.9005499 2391.4715062 0.0000000


We can see in fit.x1 that X1 does not have a significant relationship with Y. The model explains <1% of the variation in Y. For fit.x2 we can see that it has a significant positive relationship with Y. This model explains 19.3% of the variation in Y.

  1. Now use lm() to build a multiple regression model using both predictor variables X1 and X2. Comment on the fit and the statistical significance of each predictor variable. What percent of the variability in Y is explained by the model now that both predictors are included? Give an explanation for what you think is happening with both predictors in the model.
term estimate p.value conf.low conf.high
(Intercept) -0.0002643 0.8542291 -0.003084 0.0025555
x1 -2.0397378 0.0000000 -2.046208 -2.0332674
x2 2.2674010 0.0000000 2.260956 2.2738461
r.squared adj.r.squared sigma statistic p.value
0.979412 0.9794078 0.143849 237788.1 0

We can see that when we include X1 and X2 in the model together that x2 remains significant but has an increased estimated effect given a level of X2. This time X1 has a significant effect and it shows to be negative given a level of X2. Togteher they explain 97.94% of the variation in Y. It is hard to say what exactly is going on without investigating further. It seems that when controlling for both X1 and X2 at the same time we are able to account for more variation in Y.

  1. Comment on the change in the estimated coefficients from the simple linear regression models compared to those from the multiple regression model. Are the changes qualitative (direction), quantitative (magnitude) or both?

With X1 there is a change in both qualitative (direction) and quantitative (magnitude). With X2 there is only a quantitative (magnitude) change.

  1. Run the following code to create a 3d scatterplot. Notice that multiple linear regression is now a plane and not just a line. Why do you think X1 and X2 predict Y so well together when they do not alone?

We can see that the relationship between all of these is actually a plane. This suggests that it requires more to explain the variability in Y than just one variable alone. Everything we have seen shows that when you have the same level of X1 that an increase in X2 will lead to an increase in Y. However when you have the same level of X2 an increase in X1 leads to a decrease in Y. This partitioning of the data allows us greater understanding of the variance of Y.


Part 2

The Data

Data set hw1b contains air pollution data from 41 U.S. cities. Our goal is to try to build a multiple regression model to predict SO2 concentration using the other variables.

Variable Name Description
so2 SO2 air concentration in micrograms per cubic meter.
temp Average Annual temperature in degrees F.
empl20 The number of manufacturing companies with 20 or more workers.
pop The population in thousands.
wind The average annual wind speeds in miles per hour.
precipin The average annual precipitation in inches.
precipdays The average number of days with precipitation per year.

  1. Load data set HW1b and answer the following questions. Display all useful code and output inline. Do not just display all that R gives you but display parts that show why you chose the model you did.
  1. Begin by examining univariate summaries of the 7 variables. Do any of the points seem to have extreme values? Comment on whether cities with extreme values also have extremes on one or more other variables.

Variable Minimum 1st Qua. Median Mean 3rd. Qua. Maximum
SO2 8 13 26 30.05 35 110
temp 43.5 50.6 54.6 55.76 59.3 75.5
empl20 35 181 347 463.1 462 3344
pop 71 299 515 608.6 717 3369
wind 6 8.7 9.3 9.444 10.6 12.7
Precipin 7.05 30.96 38.74 36.77 43.11 59.8
preicdays 36 103 115 113.9 128 166

We can see that the table above shows that the number of manufacturing companies with 20 or more workers seems to have some extreme values. Since 75% of the data falls at 462 and below, however the max is 3344. Population also seems like it may have some extreme values since 75% of the values are at or below 717 but the maximum is at 3369. We will evaluate these 2 futher with boxplots.

## [1] 11 18 27 29
## [1] 11 18 29

From the above graphs we can see that there do appear to be some extreme values with population and number of companies with 20 or more eomployees. Record 11 seems to the the largest of these values.

b. Start your model building by looking at simple linear regressions for each of the 6 predictor variables. Display and Examine relevant plots. Summarize the simple linear regression results using the broom package. Note that you can combine tidy statements:

term estimate p.value conf.low conf.high
temp -1.4081325 0.0046245 -2.3559546 -0.4603104
empl20 0.0268587 0.0000054 0.0165457 0.0371718
pop 0.0200136 0.0010350 0.0085979 0.0314293
wind 1.5557412 0.5559350 -3.7417772 6.8532595
precipin 0.1082620 0.7360031 -0.5366161 0.7531401
precipdays 0.3272603 0.0174044 0.0607506 0.5937700
r.squared adj.r.squared sigma statistic p.value
0.1880091 0.1671889 21.42044 9.0300972 0.0046245
0.4157267 0.4007453 18.17025 27.7495857 0.0000054
0.2438183 0.2244290 20.67121 12.5749043 0.0010350
0.0089663 -0.0164448 23.66448 0.3528489 0.5559350
0.0029479 -0.0226176 23.73623 0.1153071 0.7360031
0.1365773 0.1144382 22.08842 6.1690682 0.0174044


We can see that with temperature we do have a significant negative predicted value which explains about 19% of the variation in So2 levels. With the number of companies with 20+ employees we have a positive highly significant relationship that explains about 42% of the variation in So2 levels. With population we see that we have a significant positive relationship that explains about 24% of the variation in So2 levels. With Average annual wind we have an insignificant relationship and it explains about 1% of the variation. With Average annual precipitation we have an insignificant relationship and it explains <1% of the variation. With Average annual days of precipitation we have an significant positive relationship and it explains about 14% of the variation.

  1. Build a multiple regression model by sequentially adding variables that you feel are important from the simple linear regressions.


From our discussion above we will begin with the 2 variables with the best fit. Which is companies with 20+ employees and population.

term estimate p.value conf.low conf.high
(Intercept) 26.3250833 0.0000000 18.5505206 34.0996460
empl20 0.0824341 0.0000020 0.0526825 0.1121857
pop -0.0566066 0.0003192 -0.0855548 -0.0276584
r.squared adj.r.squared sigma statistic p.value
0.5863202 0.5645476 15.48908 26.92924 1e-07

We can see from the model summary that both empl20 and pop have significant effects on So2. empl120 shows a positive relationship given pop which is actually larger than it was before adjusting for pop. When adjusting for empl20 the population actually has a negative relationship with so2. From here I will try and add in temperature.

term estimate p.value conf.low conf.high
(Intercept) 58.1959320 0.0072804 16.6835168 99.7083472
empl20 0.0712252 0.0000796 0.0386842 0.1037661
pop -0.0466475 0.0043900 -0.0777939 -0.0155011
temp -0.5871451 0.1220314 -1.3388780 0.1645878
r.squared adj.r.squared sigma statistic p.value
0.6125468 0.5811317 15.19126 19.49847 1e-07

With the addition of temperaure. My adjusted \(R^2\) increased only a small amount to 0.6125. This would suggest that temperature does add something.


I proceeded checking different models in this fashion. Below I will discuss the final model I chose.

  1. State your final multiple regression model, interpret the parameter estimates and the R2. Comment on any differences in coefficients between the simple linear regression models and the multiple regression model.

term estimate p.value conf.low conf.high
(Intercept) 100.1524457 0.0021815 38.6905106 161.6143809
empl20 0.0648871 0.0001881 0.0333293 0.0964450
pop -0.0393347 0.0124987 -0.0696575 -0.0090119
precipin 0.4194681 0.0604983 -0.0195319 0.8584681
wind -3.0823996 0.0896221 -6.6668052 0.5020060
temp -1.1212877 0.0107070 -1.9655351 -0.2770403
r.squared adj.r.squared sigma statistic p.value
0.6685085 0.6211526 14.44732 14.11668 1e-07

Your answer may differ from mine but here is my reasoning for why I chose my model. I chose to go back to my model with the predictors of pop ,empl20, precipin, wind and temp. This model has an overall \(R^2\) of 0.6685. So it explains about 67% of the variation. If you compare this model to other subsets you will find that it also have the largest adjusted \(R^2\) out of them. We can see that population and wind changes both in magnidute and in direction of effect. Even though in this model precipin and wind are not significant, the model allows me to explain about 6% more variation than if I left them out.

  1. Complete the following with the final model you built in part d.
    1. What does the adjusted \(R^2\) tell you about your model fit? Adjusted \(R^2\) is a way for us to compare models. It is a way of weighting the \(R^2\) to show the effect of adding extra variables.
    2. Perform a hypothesis test on the slope estimates for each variable in the model. If we look at the p-values in the table above we can see that empl20 is significant with a p-value of 0.0002. We can also see that pop is signifianct with a a p-value of 0.012. Finally, temp is significant with a p-value of 0.011. The rest have p-values over 0.05